-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for building an AddedVocabulary
based on a pre-existing AddedVocabulary
.
#1444
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
That looks nice I'll have a look! Thanks for opening it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I just want to make sure this is actually needed:
Consider for example the Yi tokenizer. If one tries to load a fast tokenizer for Yi, then you will end up with token IDs 64000 and 64001 for the BOS and EOS tokens, respectively. That's because that tokenizer uses custom BOS and EOS tokens but assigns them known/pre-existing IDs via the added_tokens_decoder field. These IDs end up getting ignored when building the fast tokenizer due to the code block starting here. This is also evident from the recommendation of the Yi model authors to set use_fast=False as shown here.
If the EOS token and BOS token are part of the vocab, they have and should be associated with this ID. I am not sure I follow what the issue is here
@ArthurZucker what happens with the Yi tokenizer is the following. The config of the tokenizer includes this field:
This means that the tokenizer uses ID 1 for the special So, when this tokenizer is loaded, it hits this code block. If you read that code, you'll notice that it takes the information included in the Instead, the recommendation when using the Yi model is to set use_fast=False as shown here to avoid this issue. This PR adds functionality that will enable me to fix the underlying behavior in the linked code block from the |
I understand your requirements but that is not the wait to go with def vocab(self, proto):
vocab = [
("<unk>", 0.0),
("<s>", 0.0),
("</s>", 0.0),
]
vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
return vocab that is hard coded in the We should modify the WDYT? |
Hmm so let me take a step back and summarize the changes here and also try to understand your suggested fix. First of all, I assume the changes introduced in #1443 are uncontroversial, right? Because that's just adding some functions that appear to be missing in the first place. If so, could you please approve that one so we can decouple the changes? Regarding the main change introduced in this PR (i.e.,
Could you please elaborate? Also, how does it relate to this code block that I linked earlier which seems broken for Yi (due to the reuse of pre-existing token IDs as I described earlier). |
Of course! |
@ArthurZucker thanks! In that case, I can look into that fix for |
huggingface/transformers#29797 is probably what you are looking for! |
@ArthurZucker that's great, thanks! |
(This PR is stacked on top of #1443 and so currently shows the changes for both PRs together)
The changes introduced in this PR are necessary if one wants to maintain the token IDs of the pre-existing
AddedVocabulary
and more importantly, they are necessary if one wants to build a fast tokenizer correctly based on a Python tokenizer.Consider for example the Yi tokenizer. If one tries to load a fast tokenizer for Yi, then you will end up with token IDs 64000 and 64001 for the BOS and EOS tokens, respectively. That's because that tokenizer uses custom BOS and EOS tokens but assigns them known/pre-existing IDs via the
added_tokens_decoder
field. These IDs end up getting ignored when building the fast tokenizer due to the code block starting here. This is also evident from the recommendation of the Yi model authors to setuse_fast=False
as shown here.That's at least partially because this library does not support building tokenizers using a pre-existing
AddedVocabulary
. This PR adds support for that. I plan to follow up with a PR in thetransformers
library after this PR is merged, updating that code block.